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Abs-sracx 

We invw.gite ;fce space c: a^ prcttn sequences. "We combine the standard measures of 
»;xr:iiar::y ( :c associate with each s^qjsnct a^, exhaustive lis; cf sesghbcc;;-.* sequences. These 
Us:s induce s. \wei°htec directed) £rash whose *-eruces are the sequences. The wegh: of an 
edge conaect^g wc t sciences r«piew:rti tl\r-r degree cf similarity. This graph encode* much 
cf *.h- fundamental prco«;t;es c; Ue aeq;:er k st a woe. The idea thai undersea cur work is that 
-;i:cTes*:r.g Lontcicgies among pr cteins can be aeiuced by transitivity, 

If v*e eliminate ail edges o: weigst c«;cw a certain r.gtufleaiiea threshold, the graph split* toe 
eonneced components. These automatically induced sets o: proteins are closely correlv.ed with 
:u.;ural biological families. By performing tais pcocsdure at varying thresholds, in a stepwise 
rsou«:> we cbta;n a hierarchic &i ccje-uaaticn of the connected cccnpccantfi-, and thus cf all 
known proteins. 

The :e?u,ts show that this method successfully :dsr.t:n«M many biological :»rniiie3. By varying 
the threshold of st&ustica: significance, *e discover fbtr sub-iarniliee that moke up lenowa 
fam^'ivs o; prciems. likewise, this procedure exposes linkages between distinct protein families. 
Broadly speakit?g, protein families vjrz cut tc be connected in two diriinct ways; (i) Through 
auiti-ccraate proteins, each of which is associated with a distinct proton farnuy. or (ii) Through 
proteins much of whose ctcuenoe is shared by \ht two families. The latter may be considered as 
Linkers or ancestor proteins. Consequently, many interesting relaucna between protein families 
a:e revealed and hierarchical organization within protest families suggest themselves . 

Art inter active web site including the resuits of our analysis has been constructed, and is 
new accessible through http://prc4cmap.ca.huji.ae.il 

1 Introduction 

In recent years we have been witnessing a constant fiow of new biological data. Large-scale aeqveuc- 
bg projects throughout the world turn out tew sequence*, and create new challenges for researches. 
Many sequences that are added to the databases are unansotated and await analysis. Currently, 
12 complete genomes (of yeast, El Coli, and other bacteria) are available. Abcuc 35%-50% of 'Sheir 
proteins hava an unknown function fl], . 
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^ v,ral d**a *• analysis necessarily starts by investigating the sequence 
b the abeence of ^phobic*, charge, secondary atruc 

prop.. Sequel analye* ^^2^^ the sequence under study with the 

tu:e property and more. Tta S^e analyse. J ^ ^ ^ 

wh oie database, in ~«* close rela^. The pr ^ * ^ ^ &r t , e puipc4 . 

olated from those of its neighbor*. S*ce the 70 

of Compaq prat* sequences efficiently and much * the fences, 

ItM-allycl^™ 
„e very like* to have the ^^^^ ta fold, d«oi* a low sequence 

func-to Nevertheless, one encounter. ™*£T£j ^ Hai5 « the database, 

rarity [10]. Such a;,, unf—, a tomology aIM * 8 

sequences in the database. ^ correspondence. with high 

The work we describe here concerns to— duawrs, as , weu ^ leljd8 to 

M features of proteins (family, function «d ^^J^S^TS-m that tWe.emer** 
1 definition of a new metric on the space of all protean W^T^ £ ^ ^ of' ^ 
^tncl.morea^vethanex^ngn.eaaum.ou^ , ^ 

,.lf oration of all protein ^n"., as d-cuaaeo in [16, ; • . ..^^ ,,. ; . . .. ; 



27 '9= 14:43 7 eXTW,' Z °EQPL SCO 972 3 5=5562 P-5 

«,«.««;.. ;o :..v«« .Ml. :<» n .: a i:!S*.rte«ci ••i"-' 



2 Methods 

This section contain a description of act computational procedure. The procedure was performed 
for the awisspio; database {IT? release 33. with total « 32205 protaas. 

2.1 Defining the graph 

We rapraent the sp«* of proteiu sequences by ^ of a directed graph. The vertice. e£ th!a 
graph are the protein ae^&ces. Edges h«w«* tit vertices are weighted with weigbia that reflect 
the distance or dltaiffiiiarity taWM the correapondicg sequences, i.e. high similarity treaalate* to 
ffi asma : weight (ordistarxe). To compute the w^hi of we directed edge&om A to B,or.e comperes 

S -1 against «H sequences 5n the nrietprw database, tad obtains tee distribution of iu score. The 

H weis ht * taken as the expectation value of the smarity score betweec A and 5, based on this 

£! cistrb- tion. This is a statical estimate for the number of occurrence, of the appropriate score 

S « a raeeoa setup, aasur^g -.he ex^ng amino add composition*. When the simUarity score is 

^ nicaU, insignificant, the correepondiBg edge is discarded (details below). la otfiar words aa 

N edge anong seque'-ce A e»d E indicate* that the corresponding protein* «• likely to be relate. 

This graph has been constructed. usir.g all currently known measures of similarity between 
U proW lc spaces: taU Waterman M, ?ASTA ft) and BLAST fl. These methods are b 

W dUty V3e by biology for comparing sequences against the databa.es. Though SW tends to give 

« to best results on average, it is not uncommon that FASTA or BLAST are more taibnnative -.Wj. 

I Therefore we chose to incorporate ail three Betted* iato out graph., to achieve maximum senstmty 

' Searches may be strong* biased when the amino acid composition of the query .sequence diSers 
marked* from the overall average composer.. A case La point are the effects of low complcoty 
segrcects within the sequence Therefore, w. alao consulted the results of BLAST 
fiW.g of the query aequeace, to exclude low complexity segments (usiag the SEG program^],. 

The following sections contain a detailed description of the procedure of aesigatog weigh* to 
edge* The procedure starts from creating the neighbors Hat for each sequence, ia each of tae three 
aethoca. A numerical normaWou la applied £ntto..Q methods, so they are ail on comp«ab.e 
1 ' scale. Then.onlystatisticalsigrifcarf^^ 

of u edge Is defined as the minimum associated to it by any of the three meifcoda, to capture the 
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apparently strongest reUtaoa. 

a b otdtr to betcar idmity remote iwiologom, we wd 7ASTA with ti* BLOSWoo wjiff i u 

S'W «xd BLAST BL08VM62 sooring- 



2.2 Scaling all methods to a singl* scale 

It is relatively easy to compare between scores that a particular method assigns to different compar- 
isons. However, how doe* one compare between scores that are assigned by different methods? We 
performed the following caic'ilatiQa: Pick *ny protein, carry out an exhaustive comparison against 
the whole database and consider the highest scares in each of the methods. Now plot these values 
and compare two methods at a time. These scores show a very strong linear relation in log-log 
scale (cot shown), therefore introducing a (usually small} multiplicative factor » per each protein 
_ and per method, scale the three methods to a single reference Una 3 , 

% 2.3 Defining the neighbors' list 

2* It is. of course, very difficult to set a clear dividing line between true homologies and chance 

^ aimuaritiea, Expectation values below 1Q" 3 can be safely considered significant and those above 

Sj 10 reflect almost pure chance similarity. However the range within is difficult to characterize, and 

SI truly related proteins jnay have expectation values around 1. Ac overly strict threshold will tnise 

jL important similarities within the twilight zone, whereas an excessively liberal criterion will create 

many false connections. The exact threshold for each method was set to best discern among related 
*y and unrelated proteins * Our choke is based on the overall distribution of distances ever the entire 

0 protein space, as given by each of the three methods. 

Jf This is illustrated in Fig. 1, which shows the distribution of expectation values over the entire 

swissprot database, for 3W, FASTA, and BLAST, The graphs in Fig. 1 naturally suggest a threshold 
for each method. The distribution drawn a log-log scale is nearly linear, at low expectation values, 
but starts a rapid increase at a certain value. This value is set to be the threshold. The thresholds 
for SW f FASTA and BLAST are set- at O.I f 0,1 and 10"* respectively 4 , An edge from vertex A 
to vertex B ia maintained only if * significant score is obtained on comparing the corresponding 
proteins* Namely, if either SYV or FASTA yield an expectation value < 0.1 or BLAST'S expectation 
value is £ 

8 The difference* between FASTA tad S\V are aottiy due :o she differed scoring? maihee* that an being uaed, and 
can be corrected by multiplying the original eoore by the relative erwropy of the rwc xbacrom [22\. The differences 
berw««n StV Asd BLAST nuy be due to approximations ia estimating ih* parameter* X and X (23}* The underlying 
auumpfcioa in calculating these parameter* is that the amino aod composition of the query *eq\i«&aa U ctao to th« 
overall dutributiaa. This attittsption often fails, «£, for law oomptodty segmenes. Moreover, these paramet«i» are 
based on first order itatlssia of the sequence, the soaring matrix and to* database. The corrections that are required 
to snatch SW and BLAST 2&ay be due to inaccurate approximations of the «sctae.t«d parameters, or by higher order 
fitatbtki of the eeqoera. 

4 However, if filtering iead» to a sigaifieaa; reduce ton in the number of high scoring hfa, a mere rtringeat threihold 
iiMt for BLAST aelO- 4 . 
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Figure 1: Averaged distribution of «« values Ac- 
cording to the three mata algorithm for sequence 
caxpaxlsoa. «) BLAST PaSTA c) 5VV. Tht «"f«tribu> 
t'.ons ere p:Qtt« *n a log'.og seal*. Now *h»t tht deviation 
"rem «ralt /irt ftart* wlier in BLAST, around 10~* . whH« 
In FASTA arc SW it &urt< afounC iO" 1 , 



A major difference bstween 3LAST arc SW/FASTA is that BLAST charges no gap penalties, 
Consequently. BLAST tends to overeiximate tre statistical significance of alignments, We counter 
this behavior of BLAST by the above asymmetry in selecting the edges. While this property may 
help 3LAST reveal sig^Scanx similarities scat *hs other methods xn«s (e.g. [19]}, we have to beware 
highly fragmentary tiigssKxita thai cannot he considered biologically meaningful. Therefore; we 
ignore those BLAST scores that come from a large number of KSPa (high scoring pairs), whereas 
tho MS? (maximal segment pair) is insignificant 5 . 

Finally, even if the comparisons between proteins A and B fail to satisfy the previous criteria, 
the edge from .4. to 3 Is maintained vhen all three methods yield an expectation value < 1. 



3 Exploring the connectivity 

We next turn our attention to the connected components of the graph we created. The transitive 
closure of the similarity relation among proteins, splits the space of all protein sequences into 
connected components or clu*ter». These axe proper subsets of the whole database wherein every 
two members are either directly or transitively related. These sets are maximal In this respect 
and cannot be expanded. Thus they oner a self-organized classification of all protein sequences 
in the database* These connected components can be expected to correlate with known biological 

'Specifically, ths *w*«e and tha *iaadaid titration of che maaber of KSP* and the score of the" MSP «v* caird afr ad 
for high scoring sciences {vusp, <7jup,M*3r t inde/fSF re»p actively). Tho*e hiss that are based on aunbef of H3P» 
> pn3P -re-asPt >'^h MSP &cor» < pmmp - *vj\?, and arc net siffiifxant according te SW and FASTA. are lgiared. 
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families. The method aad the results suggest that these connected components are indeed very 
informative and robust. 

Nona that this is a. directts* graph, so it is rot necessarily symmetric. Specifically, it may 
(and does] happen that there is ax, edge from protein A to protein B, but none k the reverse 
direction. Rirthermore, even if both edges exist, their weighta may differ Therefore, our notioa 
of a component is that of a tiron? connected component. The partition into strongly connected 
components is thus more refitted that the partition Into connected components. 

This analysis can be performed at differed thresholds, or confidence levels, to obtain as 
hierarchical organization. Several connected components of a given threshold may fuse together 
at a more permissive threshold. The analysis starts at the 10" 100 threshold. Subsequent rum are 
carried out for i<r M ,10- w . „. 10*° = 1. 
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4 Results 



Hi 



Almost all of the dusters we found are meaningful. Some correspond to well known families, 
but many others correspond to less studied families. There are clusters that consist exclusively 
of unknown protens or hypothetical proteina. An interactive web site including the results of 
our analysis has been constructed (http*.//pro?omap.cs»huji ( ac.il)< This site will help the user get 
acquainted with this new map of the protein space. 

Table X shows the distribution o: cluster sizes at the different confidence levels. At each level, 
the universe of all proteins splits into connected components (clusters) . These clusters become 
larger and coarser with the decrease ox confidence levels. Consequently, the number of isolated 
proteins (clusters of size 1) decrease. Note the sharp decline in the number of midsize dusters, 
as confidence level decrease to 1Q~° . Chance similarities tend to blur the picture, and cause an 
" avalanche" f where (possibly unrelated) many families are joined to few giant clusters. 

In what foUowB we focus only on a small number of examples. The examples are based the 
observation that connected components of a given threshold may fuse together at a more permissive 
threshold. This fusion reflects the existence of eub-familzes within a family, or families within a 
super-family. 



4.1 Hierarchical organisation within protein families 

Id the next two examples we propose hierarchical organization within known families. This orga- 
nisation is based on the ^formation extracted while, moving across the d&ere&t level* of the tre» 

*A directed snpk is strongly cvnntzttdtf for every two vertices there is a directed path (ram * to y as well a* 
from y co x ■'■ : " - * v- -' ■ ' 



T-k ; "qs i j . _:5 ' 2 PcQRL iVCO 972 3 5fc55fe2 F< 9 ?4 S » i/it 

v«i»»c*y f ' ♦ iHf.xw: tit-, . — 



-10 



-7J 



;0 -u 

1Q~« 

is- 6 

;c .ac 
1Q-* 



8 
s 
a 
a 

e 

ii 

13 
15 
IS 
.7 
21 
26 
3C 
33 
36 
36 
J? 
35 
2* 
1 



10 
19 
20 
13 
2ft 
33 

39 

35 

36 

so 

56 
53 
K 

eo 

41 
0 



SO 
101 
113 

:i9 

136 

140 

:« 

176 

;ec 

190 
192 
199 
2t6 
2V 
j3? 

727 
191 
J3 



:i.20_ 
234 

?at 

13* 
361 
SSI 
266 
286 
331 
311 

n; 

359 
339 
374 
361 
376 
375 
376 
3*9 
3*2 
68 



6-10 3-5 



536 

S36 

*H 

568 

&84 

619 
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3727 

3406 

3866 

3997 

4660 

4113 

4117 

AIM 

1134 

*143 

4140 

4164 

4034 

40;2 

3003 

3755 

3536 

3232 

26*5 

2352 

1292 
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27178 

26127 

35030 
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31721 [ 
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19386 

11174 
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16666 

14263 

12963 
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10231 
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33786 

33016 

32i5l 

31206 

302C6 

291*5 

28180 

27054 

25953 

24765 

23578 

33375 

21044 

19683 

16096 

.6538 

1*830 

12662 

10263 

6260 



Table 1. BtaiributUm of clunen toy ffcatx aiae at aa* confidence lev*. 
(Scanmtg the hierarchy ow all levels) . 
4.1,1 The small G- protein /Rbs supw foxnily 

T ta gene k on* of a family of genes, that have been found in tumor virus glomes, and m 

thi viral oncogene U closely to * cellular counter^ (c^ed pro^co^^^y 

a. retrovirus casing a ^ form o: th« raa gene (» oncogene), or mutations, can causa cell 
trarafotmatlon. Indeed, mutations in r«a gene are linked to many human cancer. 

ra 6 p-ln bin. guanine nucleotide and poese* a GTPaae 
to the elation of eOdtr «64*»», aurvival and ditatSilhm. In tte kit oe~d i many 
to the rvgMwa « TW , n ahaie the fnianine nucleotide binding 

additional proteins rakted to ras we diacowed* They ail enare we ^ 
^ and are of 2-30 KDa in length, They referred *o aa amall-G-protein eupeMamfly [25]. _ 

S. fa* of protein* compel of few ^.families, », rab, ran, rbo, ^ ^ 
aub-fax^rt^ to raa, th« proteins portidpata in ceD ^Uon 
SSj (rab) and cytosWeton organizatton (rbo). I. 2 - d ^\ h ?^ 

STltai b«ec o. ti» hierarchical organic obtained by our analyai« : Total of 886 proteina, 
]ZZ T«1g^ -pifarnn, are prated. Small 
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to subfamilies, are formed at the high levels of confid«ncee, and fuse to larger clusters, when the 
threshold is lowered. At the low level of confidence of JO" 10 , this family unite with the AD?« 
ribosyiation factors family and guanine sudaoxide-bbdirg proteins, all of which are GT?- binding 
proteins- to form one cluster. Tie homology with other G proteins can be traced after screening 
the chance similarities at the level of 10*° (work in progress). 

4.1.2 The ATP- binding transporters family 

Transporters are membranous elements which provide the mechanism by 'which components cross 
the lipid bileyer within cell compartments and frost its environment, The large variety of compo- 
nent, environmental conditions and organisms make this super*f&miiy very complex and rich (26]. 
This diverged family Is another example for which an hierarchical organization is proposed (Fig. 3). 

The sub-classification distinguishes between amino- acids transporters, oligopeptide sraasporters, 
metal Transporter multidrug resistance proteins, and many more subgroups. Out of 296 proteins 
presented here. 75 are hypothetical transporters which can be classified based on this organization) 
according to their position in the tree. 

4.2 Relations between protein families 

In the rvext three examples we demonstrate how the transitivity can be used to verify the relation 
between different, but functionally related* protein families* In these examples we focus on the ' 
connections created wnen moving from one confidence level to the next level. 

4,3.1 Super-family of motor proteins 

In some cases the connection between functionally related proteins is revealed only through the 
connection with hypothetical proteins. One such example is the connection between the myosins 
and the kinesina (Fig. 4} . The isolated sets (at the level of of Idnesin and kineeia-Eie proteins 
(sets Cl,Pl t D2) f myosins (set Al), axofieme-associated proteins (set C2), and trichohyalin (set B2), 
were grouped together) in some cases via connection* with hypothetical proteins (sets Bl,C3) : to 
form a super famiiy of motor proteins (at the level of 10~ 30 ) with total of 120 proteins. All proteins 
share elongated structure, and energy dependent motor activity and are expressed throughout the 
evolutionary tree. They do vary in their directionality of action, tissue specificity and their highly 
diverged biological contexts, 
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4.2.2 Proteins involve a in iLv lrlw ej atk*0ia of rnmplpx a users 

Ch:tia synthase and ceiiulosa synthase which play a major reus in cell wall biogenesis, Eoonlatum 
proteic which are involved la the synthesis of a Usr saccharide, tad sucdsogtycaa bioeynthesia 
prolans which art involve ir. the «xopo:ysacchaiide biosynthesis, all share biological sctivi^y, in 
col pi ex sugars biosynthesis. 

Indeec. & relation between those families ;s established in our organization by lowering the wc- 
naanc* level from 10~ lc to 10"* (Fig, 5) . As la the previous) exam plea, the connection ia established 
via hypothetical proteins, based t»n weak alignments (EHg. 6). However, the basic biological fiacre 
which characterize all thaee proteins, makes she connections inevitable. 

4.2.3 Met by Uses and m«thyltrens£ereaes 

This iarr.Cy is another example for the Ira penance of hypothetical proteins as linkage proteins* 
throuf h such UnJcs a natural conaectica between related biox>gicai families is established! aa demon- 
strated for me^bylases and methyltracsferase* (rig. ?). 

Total of SC proteins it 28 isolated aets (at tie level of 10" lc ) ware connected to one cluster at 
the leva] of IC" 3 . The Y-axea o: the graph represents 11 ore** cf transitivity (labeled A-K). Ai 
the bottom end £Al,Bl.B2,C;) as wett as at the top end ',32) methylaaes and methy transferases 
ar« common. Many hypothetical protalna ere scatteiad within these clusters. 

Soxe clusters contains exclusively hypothetical proteins ;E2,H1»I2), The cor. section betweea 
the two ends of this graph is made thiough such dusters (HI ,12) . The correction* are based or. very 
sparse pairwise alignments, aad raiae a reasonable doubt on she biological significance (Fag, S), Ye*- t 
proseins at the two ends of the graph exhibit close biological function, therefore verify the validity 
of these connections, 

5 Discussion 

In this paper we address the problem of identic ing high order features within the sequence space, 

We begin by representing the oequence space as a freighted directed graph. We expire the prop- 

erties of this graph to obtain a better understanding of the space of ail protein sequences.- We 

• - in '••*«;* rem rent nf fltfOfiE connected compenenta to explore the constituents of this 
space, and their correspondence with known ir.oiogkcaj iaoiu^co* „. , , , 

at different threshold*, to obtain a hierarchical organization, This organization, reveals interesting 

relations between and within protein fan&ies. 

Two Irinda of connections between families become apparent (i) Multi- domain protehoa, each 

domain of which is associated with a different protein family, or pi) Connections through proteins 
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niucb of whtfc ff«cutsac« f i sirailar to two distinct famiii^a. 7b* tetter zaay be considered linkers or 
ancestor proseias (examples o? the vwq typas w£l b« described elsewhere). 

An in wractiv* web she inching thereadaeroiirfcEAly^alias beea cocitructed (i*tpf//protoiaap»csJ: 
At this varsioa we chosa sot to eliminate aity pos«ti*] cbraactiona. It wcujd be interesting to re- 
ceive users' feedback oa which are re&l connectJoas ard vrhich artifact tliat ought to be discarded. 
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Figur* 2i Tfce ua«U O- jwoteja family. Thia family corr.po»ed of few •u^amUtcfl, Tht composition can b* revealed 
ttsstd on the h':<r«rchseai org»nia«tion we obtained. Total of 366 proteins we grouped together into IwUtad Metis at 
diffoeor.t tevew of osnfiflence. tt form * natural attb-cl*wlficat>on within the faulty. At th* tovei of I0 -to this Vanity W 
linked with :ke ADP-'iDQ*y!*:iem factor* family end guanine nudeoOde-bindinf protei.ia. ■*.'; "* ' 
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Figure 3; Tfce ATP-^indins tran^portae* fraily. 
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orcser of 
transitivity 




Hgttns 4 J SujH»»-iJt3XtiJy oi rawas provWua. Each cfrcfe »tsndB for 9 connected csrnponwn at thr«a*»W « 10" 4 * . 
Circes' radl! arc profwrtiowta to the «ompw*n«'s size. The goapontnt't stzt appcvs next to the corrttpsnojng circle. 
Tfe cravwi edges appeared tipon taw! tig the thrwhoH to 13"™. The iettere A-D iioicrt* the ertitF of transitivity. Each 
ci«<ter referred to by its order af transfer $y A*O f Jns its portion from *«fc to f igftt. Cfjsic* wHieh catwats wfety <*? 
hypothetical proteins are Jojoi* drctud. 
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236 CDca«)r»0ATrrcCSCAIia»£X12StGCa , AVX'r . VTSDiUiT^L^fcaaWSTJl* 349 
270 XSGSP. . .VAWSXAVTVJMtf^SC^tOTM^ 321 



Figure d: Allgnmact of sw;«CM-«cexy acd awty^-l-rfcoeo. Tfrt protein* are cia«tlft«d to sets Al and 33 
■tAMottvdy, tr. Fig. 5- 
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20% identity in 120 a« cvexl*p 

&ltgxift«n't parameter: £lo8um62 matrix, gap penalties -12,-2 



Figure 5; Alignment of SY^hiail-y&aAt And aw:yfll2Jiadiu Tht protwM are clMiltitd to Mil Cl and Hi 
rejp«cttw«iy. in Fig. ? 
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